Skip to content

feat(conversations): server-authoritative resumable chat turns + tab rewrite#302

Open
Ovaculos wants to merge 21 commits into
NimbleBrainInc:mainfrom
Ovaculos:feat/conversations-tab-rework
Open

feat(conversations): server-authoritative resumable chat turns + tab rewrite#302
Ovaculos wants to merge 21 commits into
NimbleBrainInc:mainfrom
Ovaculos:feat/conversations-tab-rework

Conversation

@Ovaculos
Copy link
Copy Markdown
Contributor

@Ovaculos Ovaculos commented May 27, 2026

Summary

Rewrite the conversation tab around a server-authoritative resumable turn model (issue #254). A chat turn now runs to completion on the server regardless of the client's connection; the browser is a viewer that attaches to a per-conversation event log, replays the in-flight turn, and tails live events.

Closes #254. Closes #294 (Stop button + server-side run cancellation decoupled from transport). Includes the #253 follow-up (auto-title prompt) — upstream landed an XML-containment + sanitizer approach in #261 that is preserved here.

What this delivers

  • RunBus (src/runtime/run-bus.ts) — per-conversation in-memory log of buffered events with monotonic seq, ownership of the turn's AbortController, and a grace window for late re-attach. 500k per-run event cap with terminal error on overflow.
  • Detached startTurnPOST /v1/chat/start returns {conversationId} immediately; the turn keeps running across disconnect / refresh / tab switch. Stop button → POST /v1/conversations/:id/cancel.
  • Per-conversation SSE viewerGET /v1/conversations/:id/events replays from a client-supplied ?afterSeq=N then tails. Reconnect/backoff handled in conversation-stream.ts.
  • Per-conversation slice store (web/src/hooks/chat-store.ts) — useSyncExternalStore-backed module singleton, LRU cap, hasPendingTail for resume reconcile, probeConversation for sessionStorage-restored streaming dots.
  • Stage-1 conversation persistence at {workDir}/conversations/{id}.jsonl, user-scoped ownership, check:conversation-paths enforces it.
  • Conversations bundle UI: background-refresh list, in-place title update via a new synapse/conversation-title postMessage channel (no full list refetch on title resolve).

Known limitation: RunBus is single-process

Under platform.replicas > 1, a viewer landing on a different pod than the one that started the turn sees isActive:false and the live frames don't fan out cross-pod. Sticky routing on Mcp-Session-Id (existing prereq) mitigates the active-tab case; a pod restart or cross-pod viewer still drops resume mid-turn. A clustered Redis-backed RunBus is deferred — tracked in src/runtime/run-bus.ts. serve warns at boot when sessionStore.type === "redis" so the gap is visible.

Other fixes folded into this PR

  • Stop button now emits cancelled frame to live viewers (was: just aborted, viewer stuck isStreaming:true).
  • startTurn serializes create to prevent double-create race when client supplies an id.
  • ConversationCorruptedError → 422 in handleChatStart (was: 500).
  • Log sinks (structured-log-sink, workspace-log-sink) wrap writes in try/catch + warn-once-per-episode (was: throw into the event-emit path and crash detached turns).
  • Auto-title generates and live-updates via new conversation.title SSE event (with wsId scoping). List iframe receives via dedicated postMessage; chat panel header updates via existing chat-store hook.
  • Resume race: partial on-disk turn now completes from RunBus grace replay (no flicker). Three-way client reconcile (isActive / activeSeq>0 / GC'd).
  • Stuck streaming after >30s reconnect: server-authoritative isActive:false on subscribed frame drives chat-store reset.

Tests & hygiene

  • Runtime.start({}) now hard-errors under NODE_ENV=test when workDir is missing. Stops integration tests from silently writing echo-model conversations into the developer's real ~/.nimblebrain. Fixed 8 leakers + added test/helpers/test-workdir.ts.
  • scripts/check-code-style.ts skips node_modules/ and src/bundles/ per its doc comment (was: scanned vendored .d.ts after build:bundles, surfacing 1625 false-positive violations).
  • 20+ new tests covering chat-store slice semantics, partial-tail-completes-from-grace, complete-tail-no-flicker, RunBus overflow cap, etc.

Test plan

  • bun run verify — full CI parity, all green
  • Manual: start a turn, refresh mid-stream — assistant message reconstructs from disk + RunBus replay, no token loss
  • Manual: start a turn, switch to a different conversation mid-stream, switch back — no bleed, response still streaming on origin
  • Manual: start a turn, close the tab, reopen — title pops in live when auto-title resolves
  • Manual: Stop button — cancellation event arrives, streaming dot clears across all open tabs
  • Verify sessionStore.type: "redis" startup warning fires when configured

🤖 Generated with Claude Code

Ovaculos and others added 21 commits May 25, 2026 10:25
Decouple a chat turn's lifecycle from the originating HTTP request. A turn
now runs to completion on the server regardless of the client connection;
the client is a viewer that attaches to a replayable per-conversation event
log (RunBus), replays the in-flight turn, then tails live events. Refresh /
conversation-switch / disconnect never lose or duplicate work — they detach
and re-attach. Only the Stop button (RunBus.cancel) aborts generation.

- RunBus: in-memory, replayable, monotonic-seq turn log with grace-window
  retention for late re-attach; threads its own AbortSignal into the engine.
- runtime.startTurn: detached turn driver; seeds user.message into the log;
  reports isTurnActive / activeConversationIds / turnSeq / getTurnReplay.
- Per-conversation SSE: ConversationEventManager + /v1/conversations/:id/events
  route with subscribed{isActive} frame + buffered replay; /v1/chat/start and
  cancel endpoints.
- Broadcast conversation.title + data.changed on the global SSE when a title
  generates, so viewers update without a turn-stream connection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…imbleBrainInc#253)

The single-string transcript prompt made the fast model pattern-match a
conversation to continue and emit the start of the assistant's response as
the title (worst on creative/long answers). Use real role turns
(user → assistant → user-instruction); the trailing user turn is an
unambiguous instruction, not text to continue.

Closes NimbleBrainInc#253

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the single shared useChat with a per-conversation slice store. Each
conversation owns its own state; a turn's stream writes only into its origin
slice, so switching conversations mid-turn never bleeds the response into the
destination chat (NimbleBrainInc#254) and switching back shows it still arriving.

- chat-store: slice map (LRU-capped) backing useChat via useSyncExternalStore;
  sendTurn → /v1/chat/start then subscribe via conversation-stream (replay +
  live tail). Resume reflects server isActive (indicator/Stop survive reload),
  trims the stale in-flight turn from disk, and drops grace-buffer replay of a
  finished turn (no duplicate). AbortSignal threaded for cleanup only.
- Streaming dots: store streamingIds → hostContext (SlotRenderer) so the list
  shows live per-row activity; persisted per-tab in sessionStorage and re-probed
  on reload (dots survive refresh, self-heal when finished).
- Reload restore: last-viewed conversation reopened from sessionStorage.
- Live title: consume conversation.title SSE → slice title; ChatPanel header
  prefers the generated title. MessageList lands the view at the bottom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Per-row streaming indicator: read host-pushed streamingConversationIds
  (useHostContext) and render a pulsing dot on conversations with an in-flight
  turn.
- No more list flicker on updates: data.changed refreshes run in the
  background (no skeleton swap) and only for conversation changes — other apps'
  data.changed are ignored. Skeleton stays for initial load + view switches.
- New conversations appear immediately: ConversationIndex.flushPending() picks
  up a just-created file on the read path instead of racing the fs.watch
  debounce.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-conversation viewer replaced the old streaming path, leaving the
client-side pieces unreferenced:

- client.ts: streamChat / streamChatMultipart / consumeSSEStream (callers
  now use startChatTurn + connectConversationStream).
- conversation-sse.ts and useConversationEvents.ts (the old cross-tab SSE
  hook) — zero importers.

The server `/v1/chat/stream` + handleChatStream stay (still a valid REST
surface, exercised by integration tests). Also comment the one ported
`as unknown as` cast for the done-payload `files` field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…prompt change

The capturing-model test helper skipped auto-title model calls by matching
the old prompt string ("Generate a 3-6 word title"). The NimbleBrainInc#253 rework changed
the title system prompt, so the guard stopped recognizing title calls and they
polluted the captured chat system prompt. Match the new prompt instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Detached turns emit events after the originating request ends; if the workdir
is gone (test teardown) or a write otherwise fails (disk full, perms), the
unguarded appendFileSync threw out of emit() as an unhandled error. Wrap the
write best-effort — a failed log line must never crash the caller.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RunBus.cancel() flips the run to terminal synchronously, after which publish()
is a no-op. The engine's post-abort `cancelled` publish (startTurn's catch)
therefore never reached the SSE feed (onTurnEvent), so the Stop button aborted
generation but left the UI stuck streaming until a reload. done/error work
because they publish before ending while the run is still active.

Publish `cancelled` in cancelTurn BEFORE RunBus.cancel ends the run. The
engine's later cancelled publish becomes a harmless no-op. Adds a regression
test on the live onTurnEvent path (the attach/onEnd path the old tests used
did not cover this).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
startTurn → store.load throws ConversationCorruptedError for a pre-migration
(ownerless) conversation on the resume path, but handleChatStart let it fall
through to a raw 500. handleChat and handleChatCancel both already map it to
422 via conversationCorruptedResponse — match them. Adds an HTTP test
asserting /v1/chat/start on an ownerless conversation returns 422.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two concurrent starts with the same not-yet-existing conversationId could both
load()->null then both create({id}) — and create is a truncating writeFile with
no exists-guard, so the loser would clobber the winner's in-progress file.
runBus.begin serializes the run, but it ran AFTER create, too late to help.

For a provided id, reserve the run (begin) BEFORE touching storage: the loser
now throws RunInProgressError before it can create. On any load/create failure
we evict to release the reservation so a failed start can't leave the id stuck
"running". Adds a regression test (delayed load forces the window; asserts
create runs exactly once).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
connectConversationStream reconnects with backoff up to 30s; RunBus GCs a
terminal turn 30s after it ends. If a viewer is disconnected past that window
while the turn finishes, the reconnect's subscribed{isActive:false} carries no
replayed terminal frame — and the client only reconciled on the first (resume)
subscribed, so the slice stayed isStreaming=true until a manual reload.

Reconcile on any subscribed frame: if the server reports no active turn while
the slice still thinks it's streaming, clear the streaming state (the server is
authoritative). A terminal frame still within grace replays right after and
finalizes the content; if GC'd, the partial is kept and a reload fetches the
final — either way the spinner no longer hangs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…icker)

loadConversation reads disk then subscribes. A short turn ending in that window
left a partial snapshot, and onSubscribed hit the drop branch — discarding the
grace-buffer replay that held the full `done`, so the final response only showed
after a reload.

Mark the gap precisely instead of always recombining (which would flicker every
recently-finished conversation on open): the reader flags an assistant turn with
no run.done/run.error as `pending`. On resume the client now branches on it:

- live turn, or pending tail + retained run (activeSeq>0) → trim + apply replay
  (rebuilds the full turn, no duplicate);
- pending tail + run GC'd → refetch the now-complete transcript;
- complete tail → ignore the grace replay (no dup, no flicker — the common
  open-a-recent-conversation path is untouched).

A trailing user message (no assistant yet) also counts as pending. probe slices
have no messages, so they're unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
conversation-stream viewers track the RunBus seq (id: line) for replay/resume.
The legacy /v1/chat and /v1/chat/stream endpoints fan out via
broadcastToConversation, which is seq-less — those frames apply live but don't
advance lastSeq and have no resume. The web shell is RunBus-only; only an
external caller hitting the legacy endpoints while a web tab watches the same
conversation would mix the two. Document the boundary on both sides.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier hardening (commit 9ddfd2e) silenced log-sink throws to keep emit()
safe, but a real disk-full / perms incident produced zero operator signal.
Surface the first write failure of an episode with console.warn, then suppress
until a subsequent success re-arms — first failure visible, sustained outage
doesn't spam, intermittent recovery re-warns when it breaks again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A viewer landing on a different pod than the one running the turn sees
isActive:false and silently drops resume — RunBus is single-process and isn't
yet clustered. Sticky routing on Mcp-Session-Id (prereq NimbleBrainInc#2) mitigates for the
active tab; a pod restart or any cross-pod viewer still loses the in-flight
turn. Two changes so this isn't a silent regression:

- AGENTS.md: add a "Known limitation under replicas > 1" note under the four
  prereqs, pointing at run-bus.ts and the deferred clustered RunBus.
- serve.ts: emit a loud startup warn when sessionStore.type === "redis" (the
  one strong signal of multi-replica intent) so operators see the gap at boot.

Not a hard error — sticky routing is sufficient for many cases, and operators
may accept the limit until the clustered RunBus lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…owth

A runaway or adversarial event producer — a model stream looping
pathologically, a chatty tool spamming progress events — could push
log.events past safe memory limits before the grace-window GC fired.
Tens of thousands of token deltas held in-process for the grace window
on a multi-tenant deployment is real heap pressure.

Cap at 500k events per run with discard-as-error semantics: on overflow,
abort the turn's signal, append a synthetic terminal `error` event with
`error: "buffer_overflow"` so viewers see the cause, end the run as
`error`, and warn once. Subsequent publishes drop via the standard
`status !== "running"` guard.

500k sits ~10x over the worst legit run we can measure (extended-thinking
turn with chatty tool progress: ~30-50k events). Hitting it requires
the kind of pathological producer that is exactly what this cap exists
to terminate cleanly.

Configurable via the RunBus constructor for tests; production leaves it
at the default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`Runtime.start({})` defaulted `workDir` to `~/.nimblebrain`, so any
integration test that forgot the field silently wrote echo-model
conversations, test workspaces, and bundle data straight into the
developer's real workdir. They then showed up in the conversations tab
on `bun run dev` — confusing diagnosis that looked like a title-pipeline
bug but was really 268 of 282 convs being test fixtures with
`workspaceId: "ws_test"`.

`resolveWorkDir` now throws under `NODE_ENV=test` (bun:test sets this
automatically) when `workDir` is missing, with a clear example fix.
Catches future regressions of this category at construction time.

Eight test files were leaking:
- chat-stream-concurrent
- appcontext-wiring (3 Runtime.starts)
- dependency-checking (6 starts; share file-scope testDir)
- conversation-integration
- remote-integration (3 starts)
- runtime (7 starts; one needed its own workDir to avoid skill bleed
  from earlier same-file tests writing to the shared global skill dir)
- security/workspace-isolation
- usage-integration

New helper `test/helpers/test-workdir.ts` exposes `makeTestWorkDir(label)`
that returns `{workDir, cleanup}` for tests that want per-test isolation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime fired `emitConversationsChanged()` three times on a
first-turn chat: when isNew (sidebar row appears), at .finally (terminal
state), and again when the title resolved. Each broadcast triggered the
conversations-list iframe to refetch the whole list.

The title-resolve refetch is now redundant. The `conversation.title`
SSE event already exists (the chat panel header consumes it for live
title display); the list iframe just didn't have a channel for it.
Add one: the web shell forwards `conversation.title` via postMessage to
matching iframes, and the conversations bundle listens with a raw
`window.addEventListener` (the synapse SDK doesn't know this envelope,
which is fine — its own listener ignores unrecognized methods, no
double-handling). The iframe patches the matching row's title in-place,
no list refetch.

Drop the now-redundant `emitConversationsChanged()` from the
title-resolve handler. Broadcast count drops from 3 to 2 on first turn;
subsequent turns unchanged at 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dle src

Two unrelated cleanups:

- `web/src/api/client.ts`: comments still referenced the deleted
  `streamChat` / `streamChatMultipart` helpers (uploadResource header
  pattern note + startChatTurn docblock). Rot-flagged only; no
  behavior change.

- `scripts/check-code-style.ts`: the doc comment says "bundles are out
  of scope" but the glob walked `src/**/*.ts`, which after
  `bun run build:bundles` pulls in `src/bundles/*/ui/node_modules/`
  and surfaces 1625 false-positive inline-type-import violations in
  vendored `.d.ts` files. Skip `node_modules/` and `src/bundles/`
  per the documented scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tab-rework

# Conflicts:
#	src/api/routes/chat.ts
#	src/conversation/auto-title.ts
#	web/src/api/client.ts
#	web/src/api/conversation-sse.ts
#	web/src/components/AppWithChat.tsx
#	web/src/components/ChatPanel.tsx
#	web/src/context/ChatContext.tsx
#	web/src/hooks/useChat.ts
#	web/test/inlineError.test.tsx
#	web/test/streamingState.test.tsx
…tab-rework

Conflicts resolved in:
- web/test/inlineError.test.tsx
- web/test/streamingState.test.tsx

Adopted upstream's realClient snapshot pattern from web/test/setup.ts
(safer than `await import("../src/api/client")` per the setup.ts comment)
while preserving the branch's server-authoritative mocks
(startChatTurn / startChatTurnMultipart / cancelChatTurn +
connectConversationStream).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant